160 research outputs found
Transfer from Multiple MDPs
Transfer reinforcement learning (RL) methods leverage on the experience
collected on a set of source tasks to speed-up RL algorithms. A simple and
effective approach is to transfer samples from source tasks and include them
into the training set used to solve a given target task. In this paper, we
investigate the theoretical properties of this transfer method and we introduce
novel algorithms adapting the transfer process on the basis of the similarity
between source and target tasks. Finally, we report illustrative experimental
results in a continuous chain problem.Comment: 201
Experimental results : Reinforcement Learning of POMDPs using Spectral Methods
We propose a new reinforcement learning algorithm for partially observable
Markov decision processes (POMDP) based on spectral decomposition methods.
While spectral methods have been previously employed for consistent learning of
(passive) latent variable models such as hidden Markov models, POMDPs are more
challenging since the learner interacts with the environment and possibly
changes the future observations in the process. We devise a learning algorithm
running through epochs, in each epoch we employ spectral techniques to learn
the POMDP parameters from a trajectory generated by a fixed policy. At the end
of the epoch, an optimization oracle returns the optimal memoryless planning
policy which maximizes the expected reward based on the estimated POMDP model.
We prove an order-optimal regret bound with respect to the optimal memoryless
policy and efficient scaling with respect to the dimensionality of observation
and action spaces.Comment: 30th Conference on Neural Information Processing Systems (NIPS 2016),
Barcelona, Spai
Best-Arm Identification in Linear Bandits
We study the best-arm identification problem in linear bandit, where the
rewards of the arms depend linearly on an unknown parameter and the
objective is to return the arm with the largest reward. We characterize the
complexity of the problem and introduce sample allocation strategies that pull
arms to identify the best arm with a fixed confidence, while minimizing the
sample budget. In particular, we show the importance of exploiting the global
linear structure to improve the estimate of the reward of near-optimal arms. We
analyze the proposed strategies and compare their empirical performance.
Finally, as a by-product of our analysis, we point out the connection to the
-optimality criterion used in optimal experimental design.Comment: In Advances in Neural Information Processing Systems 27 (NIPS), 201
Active Learning for Accurate Estimation of Linear Models
We explore the sequential decision making problem where the goal is to
estimate uniformly well a number of linear models, given a shared budget of
random contexts independently sampled from a known distribution. The decision
maker must query one of the linear models for each incoming context, and
receives an observation corrupted by noise levels that are unknown, and depend
on the model instance. We present Trace-UCB, an adaptive allocation algorithm
that learns the noise levels while balancing contexts accordingly across the
different linear functions, and derive guarantees for simple regret in both
expectation and high-probability. Finally, we extend the algorithm and its
guarantees to high dimensional settings, where the number of linear models
times the dimension of the contextual space is higher than the total budget of
samples. Simulations with real data suggest that Trace-UCB is remarkably
robust, outperforming a number of baselines even when its assumptions are
violated.Comment: 37 pages, 8 figures, International Conference on Machine Learning,
ICML 201
Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes
While designing the state space of an MDP, it is common to include states
that are transient or not reachable by any policy (e.g., in mountain car, the
product space of speed and position contains configurations that are not
physically reachable). This leads to defining weakly-communicating or
multi-chain MDPs. In this paper, we introduce \tucrl, the first algorithm able
to perform efficient exploration-exploitation in any finite Markov Decision
Process (MDP) without requiring any form of prior knowledge. In particular, for
any MDP with communicating states, actions and
possible communicating next states,
we derive a regret bound, where is the diameter
(i.e., the longest shortest path) of the communicating part of the MDP. This is
in contrast with optimistic algorithms (e.g., UCRL, Optimistic PSRL) that
suffer linear regret in weakly-communicating MDPs, as well as posterior
sampling or regularised algorithms (e.g., REGAL), which require prior knowledge
on the bias span of the optimal policy to bias the exploration to achieve
sub-linear regret. We also prove that in weakly-communicating MDPs, no
algorithm can ever achieve a logarithmic growth of the regret without first
suffering a linear regret for a number of steps that is exponential in the
parameters of the MDP. Finally, we report numerical simulations supporting our
theoretical findings and showing how TUCRL overcomes the limitations of the
state-of-the-art
Second-Order Kernel Online Convex Optimization with Adaptive Sketching
Kernel online convex optimization (KOCO) is a framework combining the
expressiveness of non-parametric kernel models with the regret guarantees of
online learning. First-order KOCO methods such as functional gradient descent
require only time and space per iteration, and, when the only
information on the losses is their convexity, achieve a minimax optimal
regret. Nonetheless, many common losses in kernel
problems, such as squared loss, logistic loss, and squared hinge loss posses
stronger curvature that can be exploited. In this case, second-order KOCO
methods achieve regret, which
we show scales as , where
is the effective dimension of the problem and is usually much smaller than
. The main drawback of second-order methods is their
much higher space and time complexity. In this paper, we
introduce kernel online Newton step (KONS), a new second-order KOCO method that
also achieves regret. To address the
computational complexity of second-order methods, we introduce a new matrix
sketching algorithm for the kernel matrix , and show that for
a chosen parameter our Sketched-KONS reduces the space and time
complexity by a factor of to space and
time per iteration, while incurring only times more regret
Sequential Transfer in Multi-armed Bandit with Finite Set of Models
Learning from prior tasks and transferring that experience to improve future
performance is critical for building lifelong learning agents. Although results
in supervised and reinforcement learning show that transfer may significantly
improve the learning performance, most of the literature on transfer is focused
on batch learning tasks. In this paper we study the problem of
\textit{sequential transfer in online learning}, notably in the multi-armed
bandit framework, where the objective is to minimize the cumulative regret over
a sequence of tasks by incrementally transferring knowledge from prior tasks.
We introduce a novel bandit algorithm based on a method-of-moments approach for
the estimation of the possible tasks and derive regret bounds for it
Learning with stochastic inputs and adversarial outputs
International audienceMost of the research in online learning is focused either on the problem of adversarial classification (i.e., both inputs and labels are arbitrarily chosen by an adversary) or on the traditional supervised learning problem in which samples are independent and identically distributed according to a stationary probability distribution. Nonetheless, in a number of domains the relationship between inputs and outputs may be adversarial, whereas input instances are i.i.d. from a stationary distribution (e.g., user preferences). This scenario can be formalized as a learning problem with stochastic inputs and adversarial outputs. In this paper, we introduce this novel stochastic-adversarial learning setting and we analyze its learnability. In particular, we show that in a binary classification problem over an horizon of rounds, given a hypothesis space with finite VC-dimension, it is possible to design an algorithm that incrementally builds a suitable finite set of hypotheses from used as input for an exponentially weighted forecaster and achieves a cumulative regret of order O(\sqrt{n VC\mathscr{H} log n})$ with overwhelming probability. This result shows that whenever inputs are i.i.d. it is possible to solve any binary classification problem using a finite VC-dimension hypothesis space with a sub-linear regret independently from the way labels are generated (either stochastic or adversarial). We also discuss extensions to multi-class classification, regression, learning from experts and bandit settings with stochastic side information, and application to games
Truthful Learning Mechanisms for Multi-Slot Sponsored Search Auctions with Externalities
Sponsored search auctions constitute one of the most successful applications
of microeconomic mechanisms. In mechanism design, auctions are usually designed
to incentivize advertisers to bid their truthful valuations and to assure both
the advertisers and the auctioneer a non-negative utility. Nonetheless, in
sponsored search auctions, the click-through-rates (CTRs) of the advertisers
are often unknown to the auctioneer and thus standard truthful mechanisms
cannot be directly applied and must be paired with an effective learning
algorithm for the estimation of the CTRs. This introduces the critical problem
of designing a learning mechanism able to estimate the CTRs at the same time as
implementing a truthful mechanism with a revenue loss as small as possible
compared to an optimal mechanism designed with the true CTRs. Previous work
showed that, when dominant-strategy truthfulness is adopted, in single-slot
auctions the problem can be solved using suitable exploration-exploitation
mechanisms able to achieve a per-step regret (over the auctioneer's revenue) of
order (where T is the number of times the auction is repeated).
It is also known that, when truthfulness in expectation is adopted, a per-step
regret (over the social welfare) of order can be obtained. In
this paper we extend the results known in the literature to the case of
multi-slot auctions. In this case, a model of the user is needed to
characterize how the advertisers' valuations change over the slots. We adopt
the cascade model that is the most famous model in the literature for sponsored
search auctions. We prove a number of novel upper bounds and lower bounds both
on the auctioneer's revenue loss and social welfare w.r.t. to the VCG auction
and we report numerical simulations investigating the accuracy of the bounds in
predicting the dependency of the regret on the auction parameters
- …